Goto

Collaborating Authors

 double oracle algorithm


XDO: A Double Oracle Algorithm for Extensive-Form Games

Neural Information Processing Systems

Policy Space Response Oracles (PSRO) is a reinforcement learning (RL) algorithm for two-player zero-sum games that has been empirically shown to find approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to an approximate Nash equilibrium and can handle continuous actions, it may take an exponential number of iterations as the number of information states (infostates) grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm for two-player zero-sum games that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations an order of magnitude smaller than PSRO. Experiments on a modified Leduc poker game and Oshi-Zumo show that tabular XDO achieves a lower exploitability than CFR with the same amount of computation. We also find that NXDO outperforms PSRO and NFSP on a sequential multidimensional continuous-action game. NXDO is the first deep RL method that can find an approximate Nash equilibrium in high-dimensional continuous-action sequential games.


A Ablations

Neural Information Processing Systems

We find that past play greatly stabilizes the emergence of reciprocity in IPD. In cells containing another agent, we include the RUSP observations in these channels. In Figure 11 we show results when training with RUSP in these environments. Consistent with past work, the greedy baseline fails to reach a solution with high collective return. We use a distributed computing infrastructure used in Berner et al.


XDO: A Double Oracle Algorithm for Extensive-Form Games

Neural Information Processing Systems

Policy Space Response Oracles (PSRO) is a reinforcement learning (RL) algorithm for two-player zero-sum games that has been empirically shown to find approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to an approximate Nash equilibrium and can handle continuous actions, it may take an exponential number of iterations as the number of information states (infostates) grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm for two-player zero-sum games that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL.


XDO: A Double Oracle Algorithm for Extensive-Form Games

McAleer, Stephen, Lanier, John, Baldi, Pierre, Fox, Roy

arXiv.org Artificial Intelligence

Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of infostates grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.


A Unified Game-Theoretic Approach to Multi-agent Reinforcement Learning

#artificialintelligence

Today we will dig into a paper ripped of A Unified Game-Theoretic Approach to Multi-agent Reinforcement Learning, one of the core ideas that has been used for the development of #AlphaStar . There are several concepts in AlphaStar that won t be treated here . The aim is to dig in the concepts that what has been as the "Nash League" conceptual functioning and how game theory came to mix with reinforcement learning . At the end of this article you should have a notion of Double Oracle algorithm, Deep Cognitive Hierarchies and Policy-Space Response Oracles . For this post you should be familiarized with some concepts about game theory, like the setup of the strategic game in form of the payoff matrix, the understanding of Nash Equilibria and best response.


Security Games for Controlling Contagion

Tsai, Jason (University of Southern California) | Nguyen, Thanh H. (University of Southern California) | Tambe, Milind (University of Southern California)

AAAI Conferences

Many strategic actions carry a ‘contagious’ component beyond the immediate locale of the effort itself. Viral marketing and peacekeeping operations have both been observed to have a spreading effect. In this work, we use counterinsurgency as our illustrative domain. Defined as the effort to block the spread of support for an insurgency, such operations lack the manpower to defend the entire population and must focus onthe opinions of a subset of local leaders. As past researchers of security resource allocation have done, we propose using game theory to develop such policies and model the interconnected network of leaders as a graph. Unlike this past work in security games, actions in these domains possess a probabilistic, non-local impact. To address this new class of security games, we combine recent research in influence blocking maximization with a double oracle approach and create novel heuristic oracles to generate mixed strategies for a real-world leadership network from Afghanistan, synthetic leadership networks, and a real social network. We find that leadership networks that exhibit highly interconnected clusters can be solved equally well by our heuristic methods, but our more sophisticated heuristics outperform simpler ones in less interconnected social networks.


Multi-Step Multi-Sensor Hider-Seeker Games

Halvorson, Erik Daniel (Duke University) | Conitzer, Vincent (Duke University) | Parr, Ronald (Duke University)

AAAI Conferences

We study a multi-step hider-seeker game where the hider is moving on a graph and, in each step, the seeker is able to search c subsets of the graph nodes. We model this game as a zero-sum Bayesian game, which can be solved in weakly polynomial time in the players' action spaces. The seeker's action space is exponential in c, and both players' action spaces are exponential in the game horizon. To manage this intractability, we use a column/constraint generation approach for both players. This approach requires an oracle to determine best responses for each player. However, we show that computing a best response for the seeker is NP-hard, even for a single-step game when c is part of the input, and that computing a best response is NP-hard for both players for the multi-step game, even if c = 1. An integer programming formulation of the best response for the hider is practical for moderate horizons, but computing an exact seeker best response is impractical due to the exponential dependence on both c and the horizon. We therefore develop an approximate best response oracle with bounded suboptimality for the seeker. We prove performance bounds on the strategy that results when column/constraint generation with approximate best responses converges, and we measure the performance of our algorithm in simulations. In our experimental results, column/constraint generation converges to near-minimax strategies for both players fairly quickly.